68 research outputs found

    Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability

    Full text link
    Heterogeneous computing, which combines devices with different architectures, is rising in popularity, and promises increased performance combined with reduced energy consumption. OpenCL has been proposed as a standard for programing such systems, and offers functional portability. It does, however, suffer from poor performance portability, code tuned for one device must be re-tuned to achieve good performance on another device. In this paper, we use machine learning-based auto-tuning to address this problem. Benchmarks are run on a random subset of the entire tuning parameter configuration space, and the results are used to build an artificial neural network based model. The model can then be used to find interesting parts of the parameter space for further search. We evaluate our method with different benchmarks, on several devices, including an Intel i7 3770 CPU, an Nvidia K40 GPU and an AMD Radeon HD 7970 GPU. Our model achieves a mean relative error as low as 6.1%, and is able to find configurations as little as 1.3% worse than the global minimum.Comment: This is a pre-print version an article to be published in the Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). For personal use onl

    Multicore Max-Flow using GraphBLAS: A Usability Study

    Get PDF
    Optimizing linear algebra operations has been a research topic for decades. The compact language of mathematics also produce lean, maintainable code. Using linear algebra as a high-level abstraction for graph operations is therefore very attractive. In this work, we will explore the usability of the GraphBLAS framework, currently the leading standard for graph operations that uses linear algebra as an abstraction. We analyze the usability of GraphBLAS by using it to implement the Edmonds-Karp algorithm for s-t maximum-flow/minimum-cut. To our knowledge, this work represents the first published results of Max-Flow in GraphBLAS. The result of our novel implementation was an algorithm that achieved a speedup of up to 11 over its own baseline, and is surprisingly compact and easy to reason about

    Autotuning CUDA: Applying NLP Techniques to LS-CAT

    Get PDF
    The abstract relation between hardware parameters and program performance makes setting program parameters a difficult task. Without autotuning, software can miss low-level optimizations, resulting in lower performance. Traditionally, time-consuming trial and error search methods have been the staple of autotuning. Applying Natural language processing (NLP) based machine learning (ML) methods to source code as a means to perform autotuning-oriented tasks is a growing topic. Earlier research has, with success, performed a range of different autotuning tasks using multiple source code languages. However, most of the source code data is CPU-oriented, with very little GPU code. The LS-CAT (Large-Scale CUDA AutoTuning) dataset [BTE21] uses CUDA GPU-based kernels and generates a dataset to perform thread-coarsening. This paper implements several custom NLP-ML pipelines to evaluate ML-based thread-coarsening using the LS-CAT dataset, and a custom scoring function to ?nd the performance impact for any choice. Several model con?gurations were able to beat both random choice, 0.9400, and only selecting the largest thread-block (1024), 0.9437. Finally, the best model achieves a score of 0.9483, giving an average performance increase and speedup of 0.49 percent over the largest thread-block. Implementing self-attention mechanisms proved to counteract overfitting, while a multi-label based learning task outperformed other approaches. Compared to previous datasets [Cum+17], the LS-CAT dataset's higher thread-coarsening precision gives a more precise evaluation of the model's performance. The inst2vec embedding used in earlier works was unable to correctly parse the CUDA LLVM IR tokens, resulting in high data loss. Approaches to addressing this, and other ideas for future work, are also included

    Quasi Spin Images

    Get PDF
    The increasing adoption of 3D capturing equipment, now also found in mobile devices, means that 3D content is increasingly prevalent. Common operations on such data, including 3D object recognition and retrieval, are based on the measurement of similarity between 3D objects. A common way to measure object similarity is through local shape descriptors, which aim to do part-to-part matching by describing portions of an object's shape. The Spin Image is one of the local descriptors most suitable for use in scenes with high degrees of clutter and occlusion but its practical use has been hampered by high computational demands. The rise in processing power of the GPU represents an opportunity to significantly improve the generation and comparison performance of descriptors, such as the Spin Image, thereby increasing the practical applicability of methods making use of it. In this paper we introduce a GPU-based Quasi Spin Image (QSI) algorithm, a variation of the original Spin Image, and show that a speedup of an order of magnitude relative to a reference CPU implementation can be achieved in terms of the image generation rate. In addition, the QSI is noise free, can be computed consistently, and a preliminary evaluation shows it correlates well relative to the original Spin Image

    Linear optimization on modern GPUs

    Get PDF
    Abstract Optimization algorithms are becoming increasingly more important in many areas, such as finance and engineering. Typically, real problems involve several hundreds of variables, and are subject to as many constraints. Several methods have been developed trying to reduce the theoretical time complexity. Nevertheless, when problems exceed reasonable sizes they end up being very computationally intensive. Heterogeneous systems composed by coupling commodity CPUs and GPUs are becoming relatively cheap, highly performing systems. Recent developments of GPGPU technologies give even more powerful control over them. In this paper, we show how we use a revised simplex algorithm for solving linear programming problems originally described by Dantzig for both our CPU and GPU implementations. Previously, this approach has showed not to scale beyond around 200 variables. However, by taking advantage of modern libraries such as ATLAS for matrix-matrix multiplication, and the NVIDIA CUDA programming library on recent GPUs, we show that we can scale to problem sizes up to at least 2000 variables in our experiments for both architectures. On the GPU, we also achieve an appreciable precision on large problems with thousands of variables and constraints while achieving between 2X and 2.5X speed-ups over the serial ATLAS-based CPU version. With further tuning of both the algorithm and its implementations, even better results should be achievable for both the CPU and GPU versions

    Parallel Simulation of Probabilistic P Systems on Multicore Platforms

    Get PDF
    Ecologists need to model ecosystems to predict how they will evolve over time. Since ecosystems are non-deterministic phenomena, they must express the likelihood of events occurring, and measure the uncertainty of their models' predictions. One method well suited to these demands is Population Dynamic P systems (PDP systems, in short), which is a formal framework based on multienvironment probabilistic P systems. In this paper, we show how to parallelize a Population Dynamics P system simulator, used to model biological systems, on multi-core processors, such as the Intel i5 Nehalem and i7 Sandy Bridge. A comparison of three di erent techniques, discuss their strengths and weaknesses, and evaluate their performance on two generations of Intel processors with large memory sub-system di erences is presented. We show that P systems are memory bound computations and future performance optimization e orts should focus on memory tra c reductions. We achieve runtime gains of up to 2.5x by using all the cores of a single socket 4-core Intel i7 built on the Sandy Bridge architecture. From our analysis of these results we identify further ways to improve the runtime of our simulator.Junta de Andalucía P08-TIC04200Ministerio de Educación y Ciencia TIN2009-1319